AITopics

Country: North America > United States (0.28)

Genre:

Research Report > Experimental Study (1.00)
Workflow (0.68)
Research Report > New Finding (0.67)

Industry:

Education > Educational Setting > Online (0.67)
Information Technology (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
(4 more...)

Neural Information Processing SystemsFeb-17-2026, 23:42:17 GMT

c2f71567cd53464161cab3336e8fc865-Paper-Datasets_and_Benchmarks_Track.pdf

data mining, large language model, machine learning, (23 more...)

Country:

Asia > China > Hong Kong (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(3 more...)

Genre: Workflow (0.47)

Industry:

Information Technology > Software (0.93)
Law (0.92)
Information Technology > Services (0.68)

Technology:

Information Technology > Software (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(8 more...)

Neural Information Processing SystemsFeb-14-2026, 16:52:46 GMT

Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

While humans accomplish 72.4% of the tasks, the best

large language model, machine learning, programming language, (22 more...)

Country:

Asia > China > Hong Kong (0.04)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre: Instructional Material > Course Syllabus & Notes (0.46)

Industry:

Law (1.00)
Education > Educational Setting > Online (0.93)
Information Technology > Software (0.69)
Education > Educational Technology (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(11 more...)

arXiv.org Artificial IntelligenceDec-5-2025

Evaluating Long-Context Reasoning in LLM-Based WebAgents

Chung, Andy, Zhang, Yichi, Lin, Kaixiang, Rawal, Aditya, Gao, Qiaozi, Chai, Joyce

As large language model (LLM)-based agents become increasingly integrated into daily digital interactions, their ability to reason across long interaction histories becomes crucial for providing personalized and contextually aware assistance. However, the performance of these agents in long context scenarios, particularly for action-taking WebAgents operating in realistic web environments, remains largely unexplored. This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. We develop a novel evaluation framework that simulates multi-session user interactions by injecting irrelevant task trajectories between dependent subtasks, creating contexts ranging from 25,000 to 150,000 tokens. Through extensive evaluation of four popular models, Claude-3.7, GPT-4.1, Llama 4, and o4-mini, we observe a dramatic performance degradation as context length increases, with success rates dropping from 40-50\% in baseline conditions to less than 10\% in long context scenarios. Our detailed error analysis reveals that agents primarily fail due to getting stuck in loops and losing track of original task objectives. We further propose an implicit RAG approach that provides modest improvements by generating task-relevant summaries, though fundamental limitations in long context reasoning persist. These findings highlight critical challenges for deploying WebAgents in realistic, long-term user interaction scenarios and provide insights for developing more robust agent architectures capable of maintaining coherent task execution across extended contexts.

information, large language model, machine learning, (19 more...)

2512.04307

Country: North America > United States > California (0.46)

Genre:

Workflow (0.93)
Research Report > New Finding (0.93)

Industry:

Media (1.00)
Consumer Products & Services (1.00)
Transportation (0.93)
Leisure & Entertainment > Sports > Basketball (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Peeters, Ralph, Steiner, Aaron, Schwarz, Luca, Caspary, Julian Yuya, Bizer, Christian

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents [Technical Report]

arXiv.org Artificial IntelligenceDec-3-2025

LLM-based web agents have the potential to automate long-running web tasks, such as searching for products in multiple e-shops and subsequently ordering the cheapest products that meet the users needs. Benchmarks for evaluating web agents either require agents to perform tasks online using the live Web or offline using simulated environments, which allow for the exact reproduction of the experimental setup. While DeepShop provides an online benchmark that requires agents to perform challenging shopping tasks, existing offline benchmarks such as WebShop, WebArena, or Mind2Web cover only comparatively simple e-commerce tasks that need to be performed against a single shop containing product data from a single source. What is missing is an e-commerce benchmark that simulates multiple shops containing heterogeneous product data and requires agents to perform complex tasks. We fill this gap by introducing WebMall, the first offline multi-shop benchmark for evaluating web agents on challenging comparison shopping tasks. WebMall consists of four simulated shops populated with product data extracted from the Common Crawl. The WebMall tasks range from specific product searches and price comparisons to advanced queries for complementary or substitute products, as well as checkout processes. We validate WebMall using eight agents that differ in observation space, availability of short-term memory, and the employed LLM. The validation highlights the difficulty of the benchmark, with even the best-performing agents achieving task completion rates below 55% in the task categories cheapest product search and vague product search.

large language model, machine learning, natural language, (19 more...)

2508.13024

Genre:

Workflow (0.68)
Overview (0.68)
Research Report (0.50)

Industry: Information Technology > Services > e-Commerce Services (0.55)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(2 more...)

arXiv.org Artificial IntelligenceNov-26-2025

Fara-7B: An Efficient Agentic Model for Computer Use

Awadallah, Ahmed, Lara, Yash, Magazine, Raghav, Mozannar, Hussein, Nambi, Akshay, Pandya, Yash, Rajeswaran, Aravind, Rosset, Corby, Taymanov, Alexey, Vineet, Vibhav, Whitehead, Spencer, Zhao, Andrew

Progress in computer use agents (CUAs) has been constrained by the absence of large and high-quality datasets that capture how humans interact with a computer. While LLMs have thrived on abundant textual data, no comparable corpus exists for CUA trajectories. To address these gaps, we introduce FaraGen, a novel synthetic data generation system for multi-step web tasks. FaraGen can propose diverse tasks from frequently used websites, generate multiple solution attempts, and filter successful trajectories using multiple verifiers. It achieves high throughput, yield, and diversity for multi-step web tasks, producing verified trajectories at approximately $1 each. We use this data to train Fara-7B, a native CUA model that perceives the computer using only screenshots, executes actions via predicted coordinates, and is small enough to run on-device. We find that Fara-7B outperforms other CUA models of comparable size on benchmarks like WebVoyager, Online-Mind2Web, and WebTailBench -- our novel benchmark that better captures under-represented web tasks in pre-existing benchmarks. Furthermore, Fara-7B is competitive with much larger frontier models, illustrating key benefits of scalable data generation systems in advancing small efficient agentic models. We are making Fara-7B open-weight on Microsoft Foundry and HuggingFace, and we are releasing WebTailBench.

fara-7b, large language model, machine learning, (19 more...)

2511.19663

Country: North America > United States (1.00)

Genre:

Workflow (0.93)
Research Report (0.64)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Consumer Products & Services (1.00)
(3 more...)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.67)

Garkot, Sofiya, Shamrai, Maksym, Synytsia, Ivan, Hirna, Mariya

GUIrilla: A Scalable Framework for Automated Desktop UI Exploration

arXiv.org Artificial IntelligenceOct-21-2025

Autonomous agents capable of operating complex graphical user interfaces (GUIs) have the potential to transform desktop automation. While recent advances in large language models (LLMs) have significantly improved UI understanding, navigating full-window, multi-application desktop environments remains a major challenge. Data availability is limited by costly manual annotation, closed-source datasets and surface-level synthetic pipelines. We introduce GUIrilla, an automated scalable framework that systematically explores applications via native accessibility APIs to address the critical data collection challenge in GUI automation. Our framework focuses on macOS - an ecosystem with limited representation in current UI datasets - though many of its components are designed for broader cross-platform applicability. GUIrilla organizes discovered interface elements and crawler actions into hierarchical GUI graphs and employs specialized interaction handlers to achieve comprehensive application coverage. Using the application graphs from GUIrilla crawler, we construct and release GUIrilla-Task, a large-scale dataset of 27,171 functionally grounded tasks across 1,108 macOS applications, each annotated with full-desktop and window-level screenshots, accessibility metadata, and semantic action traces. Empirical results show that tuning LLM-based agents on GUIrilla-Task significantly improves performance on downstream UI tasks, outperforming synthetic baselines on the ScreenSpot Pro benchmark while using 97% less data. We also release macapptree, an open-source library for reproducible collection of structured accessibility metadata, along with the full GUIrilla-Task dataset, the manually verified GUIrilla-Gold benchmark, and the framework code to support open research in desktop autonomy.

artificial intelligence, large language model, natural language, (17 more...)

2510.16051

Country: North America > Canada (0.28)

Genre:

Research Report (0.71)
Workflow (0.46)

Industry: Information Technology (0.68)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Neural Information Processing SystemsOct-10-2025, 15:48:16 GMT

c2f71567cd53464161cab3336e8fc865-Paper-Datasets_and_Benchmarks_Track.pdf

Furthermore, extensive analysis ( 4.3) on Spider2-V demonstrate that these strategies remarkably

agent, application, spider2-v, (15 more...)

Country:

Asia > China > Hong Kong (0.04)
North America > Canada > Ontario > Toronto (0.04)
Europe > Belgium > Brussels-Capital Region > Brussels (0.04)
(3 more...)

Industry:

Information Technology > Software (0.93)
Law (0.92)
Information Technology > Services (0.68)

Technology:

Information Technology > Software (1.00)
Information Technology > Information Management (1.00)
Information Technology > Data Science > Data Mining (1.00)
(8 more...)

Neural Information Processing SystemsOct-10-2025, 03:51:45 GMT

Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

While humans accomplish 72.4% of the tasks, the best

agent, arxiv preprint arxiv, pyautogui, (14 more...)

Country:

Asia > China > Hong Kong (0.04)
Asia > Japan > Honshū > Chūbu > Toyama Prefecture > Toyama (0.04)

Genre: Instructional Material > Course Syllabus & Notes (0.46)

Industry:

Law (1.00)
Education > Educational Setting > Online (0.93)
Information Technology > Software (0.69)
Education > Educational Technology (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
(11 more...)

arXiv.org Artificial IntelligenceOct-9-2025

WebDART: Dynamic Decomposition and Re-planning for Complex Web Tasks

Yang, Jingbo, Hou, Bairu, Wei, Wei, Chang, Shiyu, Bao, Yujia

Large-language-model (LLM) agents are becoming competent at straightforward web tasks, such as opening an item page or submitting a form, but still struggle with objectives that require long-horizon navigation, large-scale information extraction, and reasoning under constraints. DART, a general framework that enables a single LLM to handle such complex chores. DART (i) dynamically decomposes each objective into three focused sub-tasks--navigation, information extraction, and execution--so the model concentrates on one skill at a time, and (ii) continuously re-plans the decomposition as new webpages are revealed, taking advantage of newly discovered filters or shortcuts and avoiding redundant exploration. LLM-powered web agents have recently shown promising abilities in web navigation tasks (Drouin et al., 2024; He et al., 2024; Wei et al., 2025; Y ang et al., 2024a; Pan et al., 2024; Song et al., 2024). Benchmarks such as WebArena (Zhou et al., 2023) demonstrate that these agents achieve reasonable accuracy on simple objectives, highlighting their potential as general-purpose automation tools. However, when the objectives require more complex reasoning and multi-step exploration, the performance of these agents often collapses. As shown in Figure 1, on WebChoreArena (Miyai et al., 2025), a benchmark designed to test higher-complexity web tasks, agents powered by GPT -4o achieve only 8.0% accuracy on tasks across different web domains, far below the 46.6% accuracy on WebArena. This gap highlights a critical weakness of current worflows: while sufficient for simple goals, they are not well equipped for tasks demand multi-step reasoning, long-horizon navigation, and structured information processing. A closer examination reveals that the difficulty arises from cognitive overload. Complex tasks require agents to simultaneously navigate across multiple web pages, extract and track large amounts of information, and reason under constraints. Consider the following task from WebChore-Arena (Miyai et al., 2025): "T ell me the top 3 products with the highest number of reviews in Home Audio of Electronics within the price range of $1,000 to $9,999". As illustrated in Figure 1, product information is distributed across multiple nested web pages. Each page may contain tens of products with attributes such as price and number of reviews.

large language model, machine learning, natural language, (19 more...)

2510.06587

Genre:

Workflow (0.93)
Research Report > New Finding (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.35)